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Given an arbitrary single-qubit operation, an important task is to efficiently decompose this 
operation into an (exact or approximate) sequence of fault-tolerant quantum operations. We derive 
a depth-optimal canonical form for single-qubit quantum circuits, and the corresponding rules for 
exactly reducing an arbitrary single-qubit circuit to this canonical form. We focus on the single- 
qubit universal {H, T} basis due to its role in fault-tolerant quantum computing, and show how our 
formalism might be extended to other universal bases. We then extend our canonical representation 
to the family of Solovay-Kitaev decomposition algorithms, in order to find an e-approximation to the 
single-qubit circuit in polylogarithmic time. For a given single-qubit operation, we find significantly 
lower-depth e-approximation circuits than previous state-of-the-art implementations. In addition, 
the implementation of our algorithm requires significantly fewer resources, in terms of computation 
memory, than previous approaches. 



I. INTRODUCTION 

Quantum algorithms assume the ability to perform any 
quantum operation, however a scalable quantum com- 
puter will likely require the compilation of an arbitrary 
quantum operation into a discrete set of fault-tolerant op- 
erations. Various methods of decomposing an arbitrary 
quantum gate into a sequence of gates drawn from a uni- 
versal, discrete set are known pQ, and typically require 
first decomposing the operation into controlled single- 
qubit unitaries [2], and then decomposing the single- 
qubit unitaries into a circuit of gates from a universal 
basis 3 5 . Given that the Steane code [5] and the sur- 
face code [7] yield high error thresholds, we choose to 
decompose into the basis containing the Hadamard op- 
eration (H) and the 7r/8 rotation (T), {H,T}, since both 
gates can be implemented fault-tolerantly in these codes. 

Decomposing into a discrete gate set rarely results in 
an exactly equivalent unitary; the resulting sequence is 
more often an e-approximation to the original unitary. In 
both cases, it is crucial for the quantum gate decomposi- 
tion algorithm to minimize the circuit resources, such as 
the circuit depth, the number of gates of a certain type, 
or the number of qubits. Since the cost of implementing 
a non-Clifford gate fault-tolerantly is higher than in the 
case of a Clifford gate, we choose to minimize the number 
of non-Clifford T gates. We call the corresponding cost 
the T-count of the sequence. Our approach simultane- 
ously minimizes circuit depth. 

The Solovay-Kitaev theorem [3] states that for any e 
and single-qubit gate U, there exists a discrete approx- 
imation to U with precision e using 0(log c (l/e)) gates 
drawn from the universal, discrete gate set, where c is 
a small constant. A constructive proof of the Solovay- 
Kitaev theorem was shown by Dawson et al. [5] and 
gives an algorithm to find an e-approximation in time 
0(log 2 ' 71 (l/e)). The resulting gate sequence has depth 
logarithmic in precision e. 

Optimizing a given cost, such as the T-count, be- 
comes especially important in the context of the Dawson- 



Nielsen algorithm |8j. The algorithm begins with a base 
approximation and then proceeds recursively, resulting 
in a circuit composed of 0(5") base circuits, where n is 
the recursion depth. The precision of the resulting cir- 
cuit heavily depends on the precision of the base "0- level" 
circuits; if a base circuit has suboptimal cost, then this 
inefficiency is amplified upon composition. In addition, 
the cost of a composition is often smaller than the sum of 
the costs of the factors (sub- additive); a resulting circuit 
can often be compressed into a circuit with lower cost, 
even if the constituent factors are already optimal. 

One technique for finding a better base circuit is given 
by Fowler [5]. His algorithm uses previously computed 
knowledge of equivalent subcircuits to find a depth- 
optimal e-approximation to a single-qubit gate, and runs 
in exponential time (and much faster than brute-force 
search). Our canonical form algorithm does not require 
costly uniqueness checks and is relatively parsimonious 
in the number of canonical circuits it generates. 

Amy et al. [TU] describe an algorithm for decompos- 
ing an n-qubit unitary into an exactly equivalent depth- 
optimal circuit in time 0{d \B\ d ^ 2 ), where d is the depth 
of the circuit and B is the basis. The technique is based 
on a mcct-in-the-middle algorithm and may be asymp- 
totically better than Fowler's algorithm when determin- 
ing exact sequences. Their approach can also be used 
for multi-qubit circuit decomposition. We note that for 
single-qubit circuits, our canonical form algorithm can be 
used to find an exact decomposition, if it exists, in im- 
proved time complexity 0{d \B\ d ^), where B — {H, T}. 

In this paper, we derive a canonical form for single- 
qubit unitaries. A similar representation was given by 
Matsumoto and Amano [TT] , who develop a normal form 
for {H, T}-circuits, where two circuits in normal form 
compute the same unitary matrix if and only if the two 
circuits are syntactically identical |12) . The first key dif- 
ference between our canonical form and the normal form 
in [11] is that their form is expressed in SU(2), which 
contains a non-trivial two-element center that makes the 
algebra sensitive to the sign of the global phase; in con- 
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trast, our canonical representation of circuits over the 
{H,T} basis is developed using group identities in the 
projective special unitary group PSU(2). By factoring 
out the global phase and working in PSU (2) , we are able 
to further compress normal circuits. The second key dif- 
ference is our concept of canonical circuit, which is a 
unique representative of a double coset of circuits with 
respect to the Clifford group. It allows further compres- 
sion of the depth of a circuit by writing a circuit in the 
canonical form g\.c.gi, where g\, gi are Clifford gates and 
c is a uniquely defined canonical circuit. Throughout, we 
use . to represent circuit composition. 
Our primary contributions are: 

1. We present a single-qubit canonical form and cor- 
responding rules for reducing a single-qubit circuit 
into the canonical form (Sec. [IT|. 

2. We develop an algorithm for finding an ex- 
act, depth-optimal decomposition of a single- 
qubit unitary, if it exists, else a depth-optimal e- 
approximation (Sec. Ill I. 



3. We develop an efficient storage database of canon- 
ical circuits and an efficient search procedure over 
the database (Sec. IV). 



4. We develop an algorithm for finding an e- 
approximation to a single-qubit unitary in polylog- 
arithmic time (Sec. |v|). 

We begin by describing our canonical form and the cor- 
responding reduction rules. 



II. A CANONICAL FORM AND CANONICAL 
REDUCTION OF CIRCUITS 

We start with PSU (2) representations of the 
Hadamard gate H and the 7r/8-gate T: 



H = 



i/y/2 i/V2 
i/y/2 



e -irr/8 
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The Phase gate S = T 2 and the Hadamard gate H to- 
gether generate a 24-element subgroup in PSU '(2), which 
is isomorphic to the to classical Coxeter group A3 and 
isomorphic to the 4-element symmetric group S4. We 
denote this group as C. 

We introduce the following two circuits, each composed 
of two gates, and we call these basic circuits syllables: 
TH = T.H, and SH = S.H. In PSU {2), syllable TH 
is a group element of infinite order (see Sec. 4.5.3 in 
[13]), whereas syllable SH is a group element of order 
3: SH.SH.SH = (SH) 3 = 1. 

Consider the set of all circuits generated by various 
compositions of TH and SH. We note that the basis 
{TH, SH} is an equivalent universal single-qubit basis 
to {H, T} since the following identities hold: 

H = TH(SH) 2 TH; T = (TH) 2 (SH) 2 TH. 



Throughout, we use {■} to indicate the basis elements of 
a group and (■) to indicate the group generated by those 
elements. 

We further note that because SH is a syllable of order 
3, any circuit in (TH, SH) can be immediately reduced 
to one where each SiJ-dependent subsequence is either 
SH or (SH) 2 . We also observe that any (TH, SH) cir- 
cuit with (SH) 2 anywhere in the interior immediately 
collapses to an equivalent one with smaller TH count. 
After reducing all of the powers of SH to 0, 1, or 2, 
any occurrence of (SH) 2 in the interior of a circuit has 
the TH syllables on both sides and thus is a part of a 
TH(SH) 2 TH pattern that collapses to H upon removal 
of two TH syllables. Unless this residual H is on the left 
end of the reduced circuit, it further cancels with the H 
of the preceding TH or SH. Intuitively, (SH) 2 should 
not occur in a well- formed circuit. In fact, we find that 
even single occurrences of SH can be, in a sense, further 
squeezed out of the initial sequence of a circuit, leading 
to the notion of a canonical form. 

Definition. A non-empty circuit in (TH, SH) is said to 
be normalized if it ends with TH and does not explic- 
itly contain (SH) 2 . A normalized circuit is either the 
identity I or a non-empty normalized circuit. 

In other words, a normalized circuit is either the iden- 
tity I or follows one of the two patterns: c.TH or 
c.SHTH , where c is a shorter normalized circuit. 

Definition. A normalized circuit is said to be canonical 
if it does not contain SH earlier than the fifth syllable. 

There are only six canonical circuits with fewer than 
six syllables: I, TH, (TH) 2 , (TH) 3 , (TH) 4 , (TH) 5 . The 
shortest canonical circuit that contains the SH syllable 
is (TH) 4 SH.TH. 

Proposition 1. Each (H,T) circuit U can be efficiently 
represented as either U — eg or U — H.c.g, where c is a 
normalized circuit and g £ C. 

Proposition 2. Each (H, T) circuit U can be efficiently 
represented as U — g\.c.gi, where c is a canonical circuit 
and <7i,02 € C. 

Thus the right C-coset of an arbitrary (H, T) circuit U 
contains either c or H.c, where c is a normalized circuit 
that can be efficiently identified, and the double C-coset 
of U contains a canonical circuit that can be efficiently 
identified. 

We now introduce the T-count cost and the corre- 
sponding trace level: 

Definition. The T-count of a normalized circuit is the 
number of TH syllables in that circuit. 

Definition. A trace level L t corresponding to a value t, 
where < t < 2, is the set 

L t = {U ePSU(2) \tr(U)\=t}. 
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T-count is an invariant of the gate represented by a 
canonical circuit, which follows from: 

Theorem 1. Ifc\,c 2 are C- equivalent canonical circuits, 
i.e., 3g±,g 2 G C such that c 2 and g\.C\.g 2 evaluate to 
the same gate in PSU(2), then c\ and c 2 are equal as 
(TH, SH) circuits. 

The proof of Theorem [l] is given in Appendix [E] 
Note that in our proposed canonical form, the T- 
count and and the overall circuit depth are closely tied, 
e.g., with a {H,T} canonical form there are at least 
T-count — 1 and at most T-count + 1 Clifford gates in 
the representation, and all but at most two of these gates 
are either H or HSH (the number of HSH sequences is 
guaranteed to be less than T-count — 3). 



III. DEPTH-OPTIMAL CIRCUIT 
DECOMPOSITION 

A natural technique (e.g, Fowler [5]) for finding a 
depth-optimal e-approximation of U is to incrementally 
build a database containing unique quantum gates and 
their depth-optimal (shortest length) circuit representa- 
tion, and then for a given target gate U, perform a prox- 
imity search in the database. Such a database of unique 
gates is expensive to build, store, and search. 

In contrast, a database of canonical circuits can be 
built without recursion and requires less memory for stor- 
age, allowing significantly longer (canonical) circuits to 
be maintained in practice. The following remarkable 
observation leads to a more efficient algorithm (than 
brute- force search and [S]) for finding a depth-optimal 
e-approximation: 

Corollary 1. Given a single-qubit gate U G PSU(2), 
U can be e- approximated with an (H, T) circuit with 
T-count < t if and only if one of the gates in the double 

coset C.U.C = {gi-U.g2 gi,g 2 G C} can be e- approximated 
by a canonical circuit with T-count < t. 

It follows that the optimal e-approximation of U un- 
der a certain T-count t is immediately derived from the 
optimal e-approximation of some gate G G C.U.C under 
T-count t. 

The search for metric neighbors of target gate U, where 
the measure is trace distance, in a database of all unique 
gates is then replaced by a search for metric neighbors of 
all elements of the C.U.C coset in the database of canoni- 
cal circuits. We note that there are at most 24 x 24 = 576 
elements in this coset and all of the searches can be done 
in parallel. The design of a scalable circuit look-up so- 
lution based on the canonical representation is discussed 
in more detail in the next section. 

Fowler has compiled the multiplication table for the 
group C generated by H and S = T 2 (see Appendix Al 
in [5]); here we use the same notation for the group ele- 
ments. The H, S representations of these elements can be 



found in Appendix |A"| Effective normalization of circuits 
relies on commutation relations between elements of C 
and the T gate. There are three types of relations, estab- 
lished by direct computation in PSU (2) and catalogued 
in Appcndix|B| (1) 9l .T = T,g 2 , (2) g x .T = H.T.g 2 , (3) 
gi.T = HSH.T.g 2 , where gi,g 2 G C. 

In order to work constructively with normalized and 
canonical circuits, we prove the following propositions: 

Proposition 3. The cost of finding a normalized repre- 
sentation U = eg or U — H.c.g of an (H, T) circuit U 
is linear in the size of the circuit. 

The proof of Propositions 1 and 3 is based on the actual 
normalization algorithm presented in Appendix [C] 

Proposition 4. The cost of finding a canonical repre- 
sentation g\.c.g 2 of an (H,T) circuit U is quadratic in 
the T-count of its normalization in the worst case. 

We prove Propositions 2 and 4 in Appendix [D) 

The inverse of a non-empty normalized circuit is not a 

normalized circuit. However, its special form is described 

in the following proposition: 

Proposition 5. Normalized representation of the in- 
verse c _1 of a normalized circuit c is either of the from 
H.d .H or of the form H.c' .H.S 3 , where d is a normal- 
ized circuit computable in time linear in the depth of c. 

Canonical circuits are parsimonious in terms of re- 
source requirements on a classical computer. There are 
2 f ~ 3 + 4 canonical circuits with T-count t or less; for ex- 
ample, at t = 24 the cardinality is 2, 097, 156 and the 
efficient lookup tree used to experiment with circuits of 
this size has a memory footprint of approximately 900 
MB. A classical database of canonical circuits can be used 
for many practical applications, including algorithms for 
performing Solovay-Kitaev decomposition 8J. We de- 
scribe the classical database and how to search it effi- 
ciently in Section |TV| 



IV. SEARCH FOR CANONICAL 
APPROXIMATIONS 

Let B = {&i, b 2 , b k } C PSU (2). We say that B is a 
basis with Clifford reduction if there is a proper subset CC 
(to represent "Canonical Circuits" ) of the subgroup (B) 
of all of the circuits in basis B and a computable mapping 
Cr : (B) -> CC where MU G (B), Bg x ,g 2 G C such that 
U = gx.Cr{U).g 2 . 

We also assume that there is a partial function 

cost : (B) — > Z + 

that is (1) well-defined on CC; (2) zero on C; and 
(3) subadditive w.r.t. composition, i.e., cost{U\.U 2 ) < 
cost{U\) + cost(U 2 ) (whenever both the left-hand side 
and the right-hand side are well-defined). We may addi- 
tionally assume that the cost function is strictly additive 
on CC. 
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Our findings below apply to any such basis, even 
though the implicit focus of this section is on the {H, T} 
basis with the T-count as the target cost function. Con- 
sider the e-approximation of a target gate U G PSU (2) 
to precision e > 0. Given a classical database of some 
circuits in the basis B, the database query of primary 
interest is to find the minimum cost e-approximation of 
U: 

Query 1. Find arg miring) (cost(V) dist(V,U) < e). 

Suppose now that we only have a database of some cir- 
cuits in the subset CC. The hypothetical approximating 
circuit V can be represented as h\.Cr(V).h 2 , hi, h 2 € C. 
The cost(hi.Cr(V).h 2 ) < cost(Cr(V)), by the assumed 
properties of the cost function. We also have that 
dist(hi.Cr{V).h 2 , U) = dis^CR^^h^Uh^ 1 ). 

We can now rewrite the query as 

Query 2. Find 

argmin (cost(c) dist{c, g\.U.g 2 ) < e). 

Consider the adjoint action of C on PSU(2): 

Ad g [U] = g.U.g~\g eC,U e PSU (2). 

Since g x .U.g 2 = gi-U.(g 2 .gi).g^ 1 = Ad gi [U.(g 2 .gi)], the 
query can again be rewritten as: 

Query 3. Find 

argmin (cost(c) dist(c, Ad g [U.h]) < e), 

g,hGC,c£CC 

which is equivalent to 
Query 4. Find 

argmin( min (cost(c) dist(c, Ad g [U.h]) < e). 
hec geCx.ecc 

The final query above is scalable because the adjoint 
action Ad g preserves the absolute matrix trace, whereas 
the right action U — > U.h tends to change the absolute 
matrix trace (for non-trivial elements of C). Thus the set 

{U.h h € C} tends to be distributed across several (up to 

\C\ = 24) trace levels. 

We use the absolute matrix trace as the primary key 
in our database of CC circuits. We also assume that the 
proximity of two circuits implies the proximity of their 
absolute matrix trace values. This is obviously true when 
the distance measure is given by 

dist(U, V) = ^J(2-\tr(UVl\)/2, 

where dist(U, V) < e implies that | \tr(U) \ - \tr(V)\ | < 4e. 
Throughout, we assume this distance measure, although 
other distance measures are possible. 



Now consider the list of distinct absolute trace val- 
ues {ti, ...,t r } = (J{|fr|[/./i| ,h e C} appearing in Query 
4. When e is small enough, the individual approxima- 
tion targets Ad g [U.h],h € C are distributed across non- 
intersecting neighborhoods {U \ \tr(U)\ — ij| < 5}, for 

i = 1, . . . , r and some suitable 5 > 0. 

Thus given that the database of the CC circuits is dis- 
tributed across logical computational nodes indexed by 
the absolute trace values, we have a good mapping of 
approximation target cases Ad g [U.h], h £ C across r non- 
intersecting logical computational node groups. 

Before describing ways of further partitioning the 
search space, we make the following empirical observa- 
tions: 

1. Canonical circuits with T-count < k have only 
0(2 k / 2 ) distinct absolute trace values (empirical es- 
timate: < 6 x 2 fc / 2 trace values). 

2. Each trace level L t has either zero or at most 
0(2 fc / 2 ) canonical circuits with the T-count k. 



(Whenever Conjecture 1 of Sec. VII holds, the T- 
count is constant on trace level L t ). 

3. The complexity of a search for the e-approximation 
in the database of all canonical circuits with T- 
count < k is 0(ek2 k ) when the desired approxi- 
mation exists; the non-existence of the approxima- 
tion is discovered in 0(k) steps on average and in 
0(fc2 fc / 2 ) steps in the worst case. 

Now we explore the geometry of an individual trace 
level L t = {V \tr(V)\ = t}. Except for the extreme val- 
ues t = and t = 2, this trace level has the geometry 
of a 2-dimensional Euclidean sphere with the adjoint ac- 
tion of the C faithful and isomorphic to the action of 
the group of symmetries of the octahedron with vertices 
(±1, 0, 0), (0, ±1, 0), (0, 0, ±1). The trace level L t , viewed 
as the Euclidean sphere, can be covered with 24 funda- 
mental tiles of this action. For instance, we can select 
the spherical triangle Fq with vertices at x — y = 0, 
y — z — 0, and x = y ~ z, x > 0, z > and generate all 
tiles as Ad g [Fo],g € C. Now, consider an arbitrary fixed 
fceC and the trace level {|tr-(V)| = \tr(U.h)\} viewed as 
a the tiled sphere with the C tiling introduced above. For 
the majority of matrices U.h, the individual approxima- 
tion targets Ad g [U.h], g € C are distributed across differ- 
ent fundamental tiles. 

Based on these considerations we add a collection of 
secondary indices to the database of the CC circuits where 
the secondary keys are provided by the geometry de- 
scribed above. Given < t < 2 is the value of the 
absolute matrix trace of certain circuits from CC, each 
fundamental tile Fi of the trace level L t has a face in- 
dex associated with it that lists all circuits found in the 
interior of Fi. Additionally, each pair of adjacent tiles 
has an edge index associated with it that lists all cir- 
cuits for which their common boundary of is the closest 
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FIG. 1. Trace level with 5 (out of 24) tiles and 6 (out of 14) 
vertices showing. Ei, Fi, Vi indicate edge, face, and vertex 
indices, respectively. 



such boundary. Finally, we note 14 special points called 
vertices on the trace level L t that are meeting points of 
more than two tiles (see Figure [l]). Each vertex v has a 
vertex index Vi associated with it that lists all circuits in 
L t for which v is the closest vertex. 

Consider the target U.h of the subquery of Query 4: 



(cost(c) dist(c,Ad g [U.h])) < 



mm 

g€C,c£CC ' 



Let < t < 2, such that \\tr(U.h)\ - t\ < 4e and trace 
level L t contains some circuits from CC. For the ma- 
jority of matrices U, the projection of U.h on the trace 
level L t , with high probability, is far enough from bound- 
aries of the fundamental tile F to the interior of which 
that projection belongs. Therefore in order to find the 
mm ce cc(cost(c)\dist(c, U.h)) < e) in this case it suf- 
fices to inspect the face index of that tile. For a non- 
trivial g € C the situation is isometric, so the search for 
min ce cc(cost(c)\dist(c, Ad g [U.h])) < e) can be limited to 
the interior of the Ad g [F] tile. 

Of course with lower probability, U.h will fall within 
e of some edge or vertex of the trace level L t , which 
requires the use of multiple tile, edge or vertex indices. In 
practice the above subquery should be distributed over 
all relevant secondary indices. With high probability, 
most of the indices will be immediately eliminated based 
on the trace-level geometry. 



V. APPLICATION TO SOLOVAY-KITAEV 
DECOMPOSITION 

In this section, we use our canonical representations for 
Solovay-Kitaev decomposition. Recall that the Dawson- 
Nielsen (D-N) algorithm for the Solovaty-Kitaev theorem 
[5] is recursive, and finer approximations require greater 



recursion depth. At depth level 0, D-N returns an extrin- 
sic "basic" approximation of a requested single-qubit gate 
U. At depth n, it composes an approximation from the 
depth n—l approximation U n -i and the depth n— 1 ap- 
proximations of two auxiliary matrices V n -\ and W n —i, 
such that the resulting approximation is given by 



U n = V n -- L .Wn-i.V^ 1 .wL v U n - 1 . 



(1) 



We want to maintain the canonical form for each of the 
approximating circuits at each depth level, starting with 
base level n = 0. We can efficiently lookup the 0-level 
approximations by using our design for efficient parallel 
lookup over a large database of canonical circuits (see 
Section IV). This results in an interesting tradeoff. When 



all 0-level approximations are sought in a database of 
canonical circuits with T-count < t, where t is relatively 
large, in the worst case the D-N n-level recursion may 
result in a circuit with T-count cost 0(t5 n ), seemingly 
worsening the T-count vs. precision performance curve 
for the algorithm. 

On the other hand, improving the quality of the 0- 
level approximation may in fact decrease the required 
recursion depth and exponentially decrease the circuit's 
T-count. For example, increasing the 0-level database 
scope from T-count < 12 to T-count < 28 improves the 
precision of the 0-level approximation by a factor of 9.8 on 
average. According to the D-N estimate (Sec. 3, Eq 1 in 
[5]), this results in an improvement in precision by a coef- 
ficient around 10 -6 at depth 4 and around 10 -9 at depth 
6. Thus if we have an e-approximation using a database 
containing circuits with T-count < 12, then we can ex- 
pect to have a significantly more precise e-approximation 
by expanding the database to include circuits with T- 
counts in the high 20's. 

In practice, we find that our technique scales even bet- 
ter than the D-N estimate suggests. With a database 
of 0-level approximations up to T-count = 25 or 26, we 
are limited as early as recursion depth 4 only by the ac- 
curacy of the machine-defined double type. Therefore, 
our experimental results only cover recursion depths < 3 
|14j . In terms of circuit cost, we barely exceed a T- 
count of 3000 for the longest of our circuit approxima- 
tions, whereas previous approaches cite T-counts of 10 5 
or more. 

The impact of the canonical reduction on the quality 
of the D-N commutant formula (Eq[lJ is profound. Con- 
sider first the composition of a canonical presentation 
with a normalized presentation (in this order). With- 
out loss of generality, we can consider composition in the 
form U = {g 1 .V.TH.g 2 ).[H.].W.g 3 ), where 01,52, S3 € C, 
W is normalized, and V.TH is canonical. The [■] indi- 
cates that the sequence is present in one case and ab- 
sent in the other. We are especially interested in cases 
where cancelation occurs, namely the resulting composi- 
tion has T-count smaller than the sum of the T-counts 
of V.TH and W. Cancelation is triggered by a certain 
structure of the normalization of the (H.g 2 .[H.].W) cir- 
cuit that is of the form W = [H][SH]W\.gi, where 
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g4 G C and the normalized circuit W\ is either empty 
or starts and ends with TH . By Lemma 1, the trailing 
T in gi-V.T will not cancel when W starts with H or 
SH, or when W\ is empty. Consider the remaining case: 
W = TH.W2.g4- Here, U = gi .V.SH.W 2 .g4, implying 
that T-count([7) < T-count(V) + T-count(VF). 

Further transformations are necessary when V = 
V 2 .SH. If W 2 starts with SH, i.e., W 2 = SH.W 3 , then 
U = g1-V2.W3.g4 is a normalized form and no further 
reduction in T-count is possible. However, if W 2 starts 
with TH we get the infamous TH.(SH) 2 .TH pattern, 
which reduces to H, which is likely to cascade into fur- 
ther cancelations. 

To summarize, normalized composition of circuits re- 
duces the T-count of the resulting circuit in many cases. 
An additional benefit is that by using canonical reduc- 
tion, we can restrict the number of Clifford gates as well. 
Each interior gate in a normalized circuit is cither H or 
HSH (and if the circuit is canonical then the number of 
HSH gates cannot be greater than T-count — 5). 

Given an e-approximation circuit c of a target gate U , 
for example by using D-N, the normalized form of cir- 
cuit c, denoted by n(c), is a minimal cost circuit that is 
exactly equivalent to c; however, normalization does not 
guarantee that the result is a lowest cost e-approximation 
of U . Indeed, there are potentially many normalized cir- 
cuits in the e-neighborhood of U , including some with 
T-counts lower than the T-count of n(c), that are sim- 
ply not obtainable by a specific method (e.g., the D-N 
algorithm for Solovay-Kitaev). 



VI. EXPERIMENTAL RESULTS 

We evaluate the performance of our canonical form 
and reduction techniques in two experimental scenar- 
ios. In each case, we evaluate the performance of de- 
composing 10, 000 randomly generated, single-qubit uni- 
taries into their e-approximations. First, we study the 
tradeoffs between T-count cost and precision e for the 
0-level e-approximation, employing our canonical circuit 
database. Second, we study the same tradeoffs for the n- 
level e-approximation, where n < 3, using our database, 
canonical reduction, and the recursive Solovay-Kitaev al- 
gorithm [8]. 

To evaluate our findings, we generated and cata- 
logued each of the 268,435,460 canonical circuits with 
T-count < 31. Our database of canonical circuits has 
the absolute matrix trace as its primary index, and has 
secondary indices based on the fundamental tiles of the 



adjoint representation of the C group (see Sec. IV I 



Our experiments and database required a memory 
footprint of 120GB and the use of a high-performance 
multi-core workstation. We discovered, however, that 
canonical circuits with T-count > 25 did not offer signif- 
icant improvements in T-count /precision e tradeoffs in 
the second experimental scenario using machine double 
accuracy. In practice, a database of canonical circuits of 




FIG. 2. T-count versus mean precision e (trace distance) over 
the e-approximations at 0-level for 10, 000 random unitaries. 



T-count < 25, which has cardinality 4, 194, 308 and RAM 
footprint ^2GB, is sufficient. In all cases, extensive mul- 
tithreading is required when high query throughput is 
sought. 

We compare the performance of our depth-optimal 
0-level e-approximation invoking our canonical circuit 
database with the state-of-the-art, depth-optimal base- 
line technique of Fowler [§]. Figure [2] shows the T- 
count versus the precision e for our canonical form tech- 
nique (search in our database) and for Fowler's technique, 
where Fowler uses a database of unique (H,T) gates. 
Both curves are obtained by calculating the mean pre- 
cision e for a given T-count for the e-approximations of 
10, 000 random unitary gates. 

Since both techniques are depth-optimal, we expect the 
curves to align, and hope to find that our database can 
store much longer sequences than previous techniques. 
The curves are sufficiently identical for T-counts between 
15 and 22. The slight divergence below T-count 15 is 
likely due to the fact that Fowler's technique optimizes 
for overall gate count (circuit length), whereas we op- 
timize for T-count. Fowler's method could however be 
adapted to minimize T-count, in which case the curves 
would be identical up to T-count 22. The key observa- 
tion is that reduction to canonical circuits enables a much 
larger database to beyond a T-count of 30 (without the 
use of overly extravagant hardware), where as previous 
state-of-the-art techniques obtain less compression, and 
in turn require more memory, limiting the database to 
T-count 22 [15]. 

We next study canonical forms within Solovay-Kitaev 
decomposition. We compare the use of our canonical re- 
duction within Dawson-Niclsen's algorithm to the orig- 
inal Dawson-Nielsen algorithm [5] . Figure [3] compares 
three implementations of our canonical technique to D- 
N. The canonical implementations use canonical reduc- 
tion, as well as three different canonical circuit database 
sizes, 1GB, 2GB, and 4GB, each enabling storage of cir- 



7 




SK+1G lookup Dawson code 



FIG. 3. T-count versus mean precision e (trace distance) over 
the e-approximations at n-level recursion for 10, 000 random 
unitaries and n = 0,1,2,3, where the markers indicate the 
recursion level n. 



cuits with up to T-count 24, 25, and 26, respectively. 
Each curve represents the mean precision e for a given 
T-count for the e-approximations of 10, 000 random uni- 
tary gates for recursion levels n = 0, 1, 2, 3. Both axes in 
the graph are plotted on the logarithmic scale. 

First, we note that there is no visible difference be- 
tween the 2GB canonical implementation and the 4GB 
canonical implementation. Second, we observe that our 
technique, for all three implementations, is able to find, 
for a given e, approximations with significantly smaller T- 
count. In particular, at T-counts below 500, our methods 
achieve e=5x 10~ 8 , offering a factor of 10~ 6 improve- 
ment over D-N. To improve the precision of our technique 
even further, it would require computation of the matrix 
trace using precision beyond the limit of machine double 
precision. At the best D-N precision of e = 5 x 10~ 5 , D-N 
requires roughly 100, 000 T gates on average, while our 
2GB implementation (SK+2G) requires only 120 T gates 



on average (a factor of 846 improvement). 



VII. CONCLUSIONS AND FUTURE WORK 

We have defined a depth-optimal canonical form and 
corresponding reduction rules for single-qubit quantum 
circuits. Our techniques result in significant improve- 
ments in terms of database size and achieved precision 
in the case of the depth-optimal 0-level e-approximation, 
and significant improvements in the T-count /precision e 
curve when applied to Solovay-Kitaev decomposition for 
n- levels of recursion. A natural future direction is to gen- 
eralize the definition of a canonical form to multi-qubit 
gates as well as to other universal bases. 

Another direction is to perform "lossy compression" , 
where the task is to find an approximately equivalent 
circuit (within distance e of the target gate) that requires 
less cost, in terms of a given cost function such as T- 
count or number of gates. We believe such a solution it 
will require the following conjecture: 

Conjecture 1. If C\,c% arc canonical circuits and 
T-count(c\) ^ T -count{c-i) then \tr(c\)\ ^ \tr{c2)\. 

This conjecture implies that if a trace level L t = {U £ 

PSU{2) \tr{U)\ = t} contains multiple canonical circuits, 

all of these circuits have the same T-count. We currently 
have only empirical brute-force evidence of Conjecture 1 
for T-count < 31. 
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Appendix A: Elements of the C group 

The following definitions are equivalent to the ones 
given in (Appendix Al in [5]). 

Go = Id', G\ = H; G2 = HSSH; G3 = SS; G4 = S; 
G5 = SSS; Gg = HSS; G7 = SSH] Gg, = SH; 
Gg = SSSH;Gio = SSHSSH;Gn = SHSSH; 

G12 = SSSHSSH'. G13 = HS\ G14 = HSSS] 
G15 = SSHSS', Gig = SHSS', Gn = SSSHSS; 
Gis = HSH; Gig = HSSSH\ G20 = HSHSSH; 
G21 — HSSSHSSH; G22 — SSSHS; G23 = SHSSS 



Appendix B: C/T commutation relations 



G\.T = H.T; G2-T = T.G\2\ G3.T = T.G3; 
G4.T 1 — T.G45 G5.? 1 = T.G^'. Gq.T = H .T.G3] 
Gj.T = H.T.G12] G$.T = H.SH.T.G2] 
Gq.T = H . S H .T \G 4.] Giq.T — T.G\\\ 

G\\.T — T.G2; G\2-T — T.G10; 
G13.T = H.T.G4] G14.T = H.T.G§\ 
G15.T 1 = H .T .G\\ \ Giq.T = H.SH.T.G 10', 
G17.T = H.SH.T.G5', Gig.T — H.SH.T\ 
Gig.T = H.SH.T.G\2\ G20-T — H.T.G2] 
G21.T = H.T.G10; G22-T — H.SH.T.G3; 
G23-T = H.SH.T.Gn 



Appendix C: Proof of Propositions 1 and 3 

Since T 2 = S G C, any (H.T) circuit has the form 

U = (n, = i 9i- T )-9, k>0, where g,gi € C, g, ^ Id when 
i > 1. 

Collect all factors in this product (in the order they 
appear) into a gateList. The following algorithm is tail- 
recursive, and group C is denoted by C: 

Algorithm 

CircuitNormalize (input : gateList) : gateList = 
if gateList is empty then 

return empty list 
let left <- {head ( input ) } 
let right <- tail (input) 
while (left is not empty) && 
(right is not empty) do 
if head (right) = T then 
if head(left) = T then 
left <- tail (left) 



right <- {G4> + tail (right) 
// G4=S=T.T 
else // head(left) in C 

if head(left) = H[SH] then 

left <- {T} + left 

right <- tail (right) 
else 

let cmt <- //see Appendix 2 

apply C/T commutation 

table to head(left) and T 
left <- tail(left) 

if (cmt = H[SH] .T.g , g in C) 

then 

left <- { g, T, H[SH]> + left 

right <- tail (right) 
else if (cmt = T.g , g in C) 
then 

right <- { T, g> + right 

else 

if head(left) = T then 

left <-{head (right)} + left 
else // head(left), head(right) in C 
let g <- C product of 

head(left) and head(right) 
left <- tail (left) 
if g <> Id then 
left = {g} + left 
right <- tail (right) 
if left is empty then 

return CircuitNormalize (right) 

else 

return reverse (left) + right 

The intent of this algorithm is to eliminate all of the 
Clifford gates that are different from either H or HSH 
from the interior of the "gateList" . The cost of each such 
elimination is bound by a constant. Thus the cost of the 
algorithm is linear in terms of the number of such Clifford 
gates and hence linear in terms of the length of the input 
circuit. 



Appendix D: Proof of Propositions 2 and 4 

Lemma 1. A normalized circuit of the form U — 
SHTH.c (where c is a normalized subcircuit) can 
be effectively rewritten as a normalized representation 
H.SHTH.c\.g 1 g G C with the number of rewrites linear 
in the T-count of c. The resulting circuit c\ has the same 
T-count as c. 

Proof. By brute force, we establish that SHTH = 
HSHT.HSS and "upset" the normalization to start with 
HSHT.HSS.c. The rest of the proof is similar to the 
proof of Propositions 1 and 3, i.e., we establish by linear 
induction that HSS.c reduces to H.c\.g,g € C, where ci 
is a normalized circuit. □ 

Informally, if a normalized circuit starts with SH then 
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we can force it into a normalized presentation that starts 
with H. 

We are now ready to prove Propositions 2 and 4. 

Proof. Let U — [H.]c.g be a normalized representation of 
a given U G PSU (2) . Note that c may start with the 
SH syllable, in which case, we split it off. Now consider 
U = [H.][SH.]c.g, where c is a normalized circuits start- 
ing with the TH syllable. Further proof is based on the 
following identities that can be established by brute-force 
calculation in PSU (2): 

THSHT = G2.THT.G4; THTHSHT = G 3 .THTHT.G 2 ; 
THSHTHT = Gio.THTHT.Gu; 
THSHTHSHT = G 2 .THTHT.G 5 ; 
THTHTHSHT = G 1V THTHTHT.G 4 ; 
THTHSHTHT = G5.THTHTHT.Gu] 
THSHTHTHT = G 4 . THTHTHT. G 12 ; 
THTHSHTHSHT = G 3 . THTHTHT. G 5 ; 
THSHTHSHTHT = G 5 .THTHTHT.G 3 ; 
THSHTHTHSHT = G W .THTHTHT.G W ; 
THSHTHSHTHSHT = G 2 . THTHTHT. G 2 ; 

Informally, these are used to "squeeze" Sif sylla- 
bles out of the first four syllables of c into surround- 
ing C factors. If c has fewer than five TH syllables, 
we immediately obtain U = g1.cf.g2, gi, 92 G C, where 
cf is a canonical circuit. We now assume that c has 
T-count t > 4 and that the propositions have been 
proven for all T-counts smaller than t. Consider the 
shortest prefix of the circuit c spanned by its leftmost 
four TH syllables and apply one of the above trans- 
formation rules to that prefix, thus obtaining reduction 
of the form U = gi.THTHTHT.g' .cf .g, gi, g' , g G C, 
where cf is a normalized circuit. Apply Proposition 1 
to subcircuit g'.cf.g to obtain a normalized presentation 
V = [H.}[SH.}c".g",g" G C, where cf' is a normalized 
circuit that is either empty or starts with TH. In the 
empty case c", we trivially get the canonical presenta- 
tion U = gi .THTHTHTH. (H. [H.] [SH.]g") . Otherwise, 
we need to consider the following three cases: 

1. V starts with H. This yields canonical presentation 
U = gi .THTHTHTH. [SH.]c" .g" ; 

2. V starts with SH, as per Lemma 1 we can force it 
to start with H and reduce to the first case. 

3. V starts with TH, i.e., V = TH.cf'.g", 
hence U = gi.THTHTHT.TH.c'" .g" 

gi .THTHT.HSH.cf" .g" , where THTHTHSH.c!" 
is normalized with T-count smaller than t. The 
latter is not canonical, since there is the SH 
occurring earlier than the fifth syllable, however 
the circuit is normalized with T-count smaller 
than t and can be recursively brought to canonical 
form as per the induction hypothesis. 

Note that the last case is the only one responsible for 
the potentially quadratic cost of the canonical reduction. 



Normalization of subcircuits of the above g' ' .cf .g form has 
linear cost. For the overall cost to become quadratic, the 
circuit shape as in clause 3 must occur 0(t) times in the 
at most t/2 recurring rewrites, which is fairly unlikely. 
In fact, in practice we have never seen clause 3 invoked 
in our experiments. □ 



Appendix E: Proof of Theorem 1 

We outline a proof by induction of Theorem 1. It is 
reminiscent of Sec. 4.2 in albeit dramatically simpler 
and shorter. 

Proof. The simple initial step is to note that if there exist 
such gi , <72 , ci , c 2 that c 2 = g1.c1.g2 as matrices and c 2 7^ 
ci as circuits then there exists a normalized circuit n, 
with T-count(rt) > 0, that evaluates to a matrix in C. 
Since SH G C and T-count (Si/) = 0, n, without loss of 
generality, starts with TH. 

Now consider the adjoint action of PSU(2) on its Lie 
algebra L — su(2), ad u [m] = u.m.iv, u G PSU (2), m G 
L. It is a well known fact that L consists of zero-trace 
Hermitian matrices and is spanned over R by the Pauli 
matrices X, Y, Z. 

The adjoint action of the C subgroup on L is the 
symmetry group of the octahedron with vertices at 
±X, ±y, ±Z. In particular, for each g G C, ad g [Z] must 
be one of these vertices. To obtain a contradiction it 
suffices to show that for a normalized circuit n, ad n (Z) 
cannot be in {±X,±Y,±Z}. 

Let A G L be a matrix over Q(^/2) represented as: 
(V2) l A= (x + xiV2)X + {y Q + yiV2)Y+{z + ziV2)Z, 

where Xo, x±, yo, Vi, zq, z\ are integers. 

We show that if A = ad n (Z) then (1) xq is odd and 
(2) yo i z o have the opposite parity. The (1) implies that 
the coefficient at X is non-zero and the (2) implies that 
at least one other coefficient (at Y or at Z) is non-zero; 
together they imply that ad n (Z) cannot be proportional 
to any one Pauli matrix. 

We prove the desired properties (1) and (2) by induc- 
tion on the T-count of n. By direct computation: 

ad TH (X) = Z, 

ad TH (Y) = (X-Y)/V2, 

ad TH (Z) = (X + Y)/V2, 
ad S HTH{X) = Y, 
ad SHTH (Y) = (-X + Z)/V2, 
ad SHTH (Z) = (X + Z)/V2, 

and, in particular, properties (1) and (2) hold for 
ad TH (Z) = (X + Y)/V2 {x = l,y = l,z = ). 
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Given matrix A E L presented as shown above, we 
have: 

(V2) l+1 ad TH (A) = ({y + z ) + (yl + z x )y/2)X + 
{(z - yo) + (zi - yi)y/2)Y + (2 Xl + x V2)Z, 
{V2) l+1 ad SH TH{A) = ((z - yo) + {zi - yi)V2)X + 
(2xi + x V2)Y + ({yo + z Q ) + ( Vl + z x )\fl)Z. 



By induction hypothesis, yo,zo have opposite parity, 
therefore the new xq that is equal to either yo + zo or zo — 
y is odd in both cases. In the expression for adTH(A), 
the new y' = z — yo is odd but the new z' — 2x\ is even. 
In the expression for adsHTH(A), the new y' = 2xi is 
even but the new z' = y + z is odd. 

Since each non-trivial normalized circuit is either 
n\.TH or ni.SHTH, where n\ is a shorter normalized 
circuit, this concludes the inductive proof. □ 



